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Abstract. Relating formal grammars is a hard problem that balances 
between language equivalence (which is known to be undecidable) and 
grammar identity (which is trivial). In this paper, we investigate sev¬ 
eral milestones between those two extremes and propose a methodology 
for inconsistency management in grammar engineering. While conven¬ 
tional grammar convergence is a practical approach relying on human 
experts to encode differences as transformation steps, guided grammar 
convergence is a more narrowly applicable technique that infers such 
transformation steps automatically by normalising the grammars and 
establishing a structural equivalence relation between them. This allows 
us to perform a case study with automatically inferring bidirectional 
transformations between 11 grammars (in a broad sense) of the same 
artificial functional language: parser specifications with different com- 
binator libraries, definite clause grammars, concrete syntax definitions, 
algebraic data types, metamodels, XML schemata, object models. 


1 Introduction 

Modern grammar theory has shifted its focus from general purpose programming 
languages to a broader scope of software languages that comprise programming 
languages, domain specific languages, markup languages, API libraries, interac¬ 
tion protocols, etc [12]. Such software languages are specified by grammars in 
a broad sense that still rely on the familiar infrastructure of terminals, nonter¬ 
minals and production rules, but specify general commitment to grammatical 
structure found in software systems. In that sense, a type safe program commits 
to a particular type system; a program that uses a library, commits to using its 
exposed interface; an XML document commits to the structure defined by its 
schema — failure to commit in any of these cases would mean errors in interpre¬ 
tation of the language entity. These, and many other, scenarios can be expressed 
and resolved in terms of grammar technology, but not all structural commit¬ 
ments profit from grammatical approach (as the most remarkably problematic 
ones we can note indentation policies and naming conventions). 

One of the problems of multiple implementations of the same language, which 
is known for many years, is having an abstract syntax definition and a concrete 
syntax definition [25]. Basically, the abstract syntax defines the kind of entities 
that inhabit the language and must be handled by semantics specification. A 
concrete syntax shows how to write down language entities and how to read 
them back. It is not uncommon for a programming language to have several 


possible concrete syntaxes: for example, any binary operation may use prefix, 
infix or postfix notation, without any changes to the language semantics. Indeed, 
we have seen infix dialects of postfix Forth (Forthwrite, InfixForth) and prefix 
dialects of infix REBOL (Boron). For software languages, the problem is broader: 
we can speak of one intended language specification and a variety of abstract 
and concrete syntaxes, data models, class dictionaries, metamodels, ontologies 
and similar contracts that conform to it. 

Our definition of the intended language relies on bidirectional transforma¬ 
tions [1,17,22,28] and in particular on their notation by Meertens [17], which we 
redefine here for the sake of completeness and clarity: 

Definition 1. For a relation R C S x T, a semi-maintainer is a function 
l> : S X T ^ T, such that Vx € S,\/y G T, (x, x \> y) G R, and Vx G S,Vy G 
T,{x,y) G R^ x[> y = y. 

The first property is called correctness and ensures that the update caused by 
the semi-maintainer restores the relation. The second property is hippocraticness 
and states that an update has no effect (“does no harm”), if the original pair is al¬ 
ready in the relation [22]. Other properties of bidirectional transformations such 
as undoability are often unachievable. A maintainer is a pair of semi-maintainers 
O and <1. A bidirectional mapping is a relation and its maintainer. 

Definition 2. A grammar G conforms to the language intended by the master 
grammar M , if there exists a bidirectional mapping between instances of their 
languages. G ^ L{M) <;=> 3i? C L{G) x L{M) 

30 : L{G) X L{M) -G L{M) 

■. L{G) X L{M) ^ L{G) 

Naturally, for any grammar holds G \= L{G). 

For example, consider a concrete syntax Gc of a programming language used 
by programmers and an abstract syntax M = Ga used by a software reengineer¬ 
ing tool. We would need O to produce abstract syntax trees from parse trees 
and <1 to propagate changes done by a reengineering tool, back to parse trees. If 
those can be constructed — examples of algorithms have been seen [10,23,25], — 
then Gc conforms to the language intended by Ga ■ As another example, consider 
an object model used in a tool that stores its objects in an external database 
(XML or relational): the existence of a bidirectional mapping between entries 
(trees or tables) in the database and the objects in memory, means that they 
represent the same intended language, even though they use very different ways 
to describe it and one may be a superlanguage of the other. For a more detailed 
formalisation and discussion of bidirectional mappings and grammars, a reader 
is redirected elsewhere [29,31]. 

Roadmap. In the following sections, we will briefly present the following milestones 
of relationships between languages: 

§2. Grammar identity, structural equality of grammars 

§3. Nominal equivalence', name-based equivalence of grammars 

§4. Structural equivalence, name-agnostic footprint-matching equivalence 

§5. Abstract normalisation-, structural equivalence of normalised grammars 

Then, §6 summarises the proposed method and discusses its evaluation. 

Finally, §7 concludes the paper by establishing context and contributions. 


2 Grammar identity 


Let us assume that grammars are traditionally defined as quadruples G = 
{Af,'r,V,S) where their elements are respectively the sets of nonterminal sym¬ 
bols, terminal symbols, production rules and starting nonterminal symbols. 
Definition 3. Grammars G and G' are identical, if and only if all their eom- 
ponents are identical: G = G' <;=> J\f = N' f\T = T' /\V = V f\ S = S'. 

The definition is trivial, and in practice is commonly weakened somehow. 
For example, many metalanguages allow the right hand sides of rules from V to 
contain disjunction (inner choice), which is known to be commutative, so it is 
natural to disregard the order of disjunctive clauses when comparing grammars: 
gdt, the “grammar diff tool” used in convergence case studies [15,32] implements 
that. However, many grammar manipulation technologies such as PEG [8] or 
TXL [2], use ordered choices, so this optimisation can be perceived as premature. 
For this reason, we will explicitly abandon disjunction in later sections. 

3 Nominal equivalence 

Since identity can be seen as a trivial bijection, a disciplined weakening of Def. 3 
that works across all grammars in a broad sense, is this: 

Definition 4. Grammars G and G' are nominally equivalent, if there is a 
bijection j3 between their production rules: 

G^G' ^ J\r = M' ST = T' hS = S' h3l3:V^V', 

Vg e V',3p €V,q = /3(p); Vpi,p 2 e T’,/3(pi) = /3(p2) ^ Pi = P 2 

Algorithms that are used to construct /3 can be different. For example, in 
Popart the metalanguage is designed in such a way that it contains enough infor¬ 
mation to generate both abstract and concrete syntaxes [25]. In TIF-grammars, 
a concrete syntax specification is annotated with directions on which nodes need 
to be folded/unfolded or removed when constructing an abstract syntax tree 
(AST) [10]. In Rascal, the implode function that maps parse trees to ASTs uses 
names of nonterminals and subexpressions to direct the automatic construc¬ 
tion of /3: for example, an optional nonterminal occurrence can be mapped to a 
string, a list or a Boolean; a Kleene star can be imploded to either a list or a 
set, etc [13]. Similar techniques can be spotted in OOP [4], in MDE [6], in data 
binding frameworks [7], etc. 

Once the bijective /3 is agreed on the level of grammars, we need to construct 
a coupled maintainer on the language instance level. If it cannot be constructed, 
then /3 is useless for us. However, there are many cases when the maintainer can 
be constructed to be bidirectional {R from Def. 2 can be partial and > and/or 
<1 can be only injective but not surjective). An example of that is matching 
different representations of lists/sets: “one or more” and “zero or more” Kleene 
repetitions are commonly used in syntactic notations, but one can always be 
bidirectionally matched to the other. In our prototype implementation, we dis¬ 
regard unreachable nonterminals, treat built-in and user-defined nonterminals 


equally, allow sequence element permutations and desugar metasyntactic con¬ 
structs like “separator lists”, as well as match different kinds of lists/sets. Such 
strategy set was chosen to be valid for nominal matching techniques, but also 
useful for the following sections. 


4 Structural equivalence 


In order to reuse the methods from the nominal equivalence approach in the 
case when nonterminal names do not match, we need to construct an additional 
mapping between the nonterminals, and use that instead of nominal identity. 
We shall refer to this mapping as nominal resolution. To construct it, we gen¬ 
eralise permutations and construct signatures that express general structure of 
a production rule. Such signatures depend heavily on the expressiveness of the 
chosen metalanguage (in particular, when the formal grammar is dehned with 
just terminals and nonterminals, their use is severely limited), but most common 


Definition 5. A footprint 7r„(x) of a 
nonterminal n in an expression x is de¬ 
fined as a multiset of presence indica¬ 
tors. 


4 *) 


Definition 6. Two footprints are 
equivalent, if they are equal modulo 
repetition kinds: tt « ^ tt = ^ V tt' = 
, where (^' is C with all + elements 
replaced by * elements. 


w signatures to 

work. 

{1} 

if a; = n 

{?} 

if a; = n? 

{+} 

if a; = n"*" 

{*} 

if a; = n* 

TTn{y) 

if a: = name-.y 

7rn(ei) U 7r„(e2) 

ii X = 6162 

0 

otherwise 


Note how disjunction is missing from the definition of a footprint. We will 
see later how it can be removed from any grammar by factoring and folding. 
The identity of footprints follows the standard definition of identity of multisets 
(which naturally subsumes abstraction over permutations). We also define equiv¬ 
alence of them that generalises the treatment of lists discussed in the previous 
section. Footprints together form a signature. 

Definition 7. A prodsig, or a signature of a production rule p = (m ::= x) is 
defined as a set of tuples with nonterminals used in its right hand side and their 
footprints: aim ::= x) = {(n,7r„(x)) | n G N,'Kn[x) 0}. 

For example, the prodsig of a production rule P ::= P+ is {(P, {+})} and 
the prodsig of F ::= SS*E is {{E, {!}), (S', {1, *})}. 

Definition 8. Two production rules are prodsig-equivalent, if and only if 
there is an equivalent match between tuple ranges of their signatures: 

pc,q V(n, tt) G a{p), 3{m, 0 G a{q), tt Ki 

Consider a simple case of exactly one production rule taken from each of 
the grammars: Pm from the master grammar and Ps from the servant grammar. 
Suppose that the left hand sides of them are assumed to match, and we want 
to see if the right hand sides are matched nominally as well, and whether they 





deliver any new information with respect to nominal resolution. When prodsigs 
cr(em) and a(es) are constructed, we have effectively built relations that bind 
nonterminal names to their occurrences. By subsequently matching the ranges 
of them with either strong or weak prodsig-equivalence, we can infer nominal 
matching of the nonterminals by matching the domains of the relations. 

Definition 9. For any two prodsig-equivalent production rules p and q, p o q, 
there is (at least one) nominal resolution relationship poq that satisfies: 

V(a, b) G poq : a = oj\/ b = uj \/ Btt, 3^, tt Ri (a, tt) G cr(p), (6, G o-(g) 

\/{c^d) G po q : a = Cy^u!=>b = d, where ui denotes unmatched nonterminals. 

For two arbitrarily provided grammars (presumably of the same intended 
software language, but not necessarily admitting any kind of equivalence), we 
cannot claim the existence of only one nominal resolution that works across all 
their production rules, but we can attempt to construct a minimum possible one: 

Definition 10. Given two grammars Gi and G 2 , a nominal resolution between 
them is a relation between their nonterminals O G A/i x A /2 such that \/pi G Vi, 
if 3qj €V 2 ,Pi O qj, then 30y C 0, such that pi Oy qj. 

In our case study, we have used various definitions of the same toy functional 
language FL taken from [15]. For instance, some grammars were extracted from 
object models by analysing Java code. Since the Java implementation of FL 
used List<Expr> to represent lists, the production rules for function declaration 
and function call assumed zero or more arguments, while the master grammar 
assumed one or more. Hence, production rules ::= SiS^Ei and Ei ::= SiE^ 
were matched with their prodsig-equivalent counterparts F 2 ::= S 2 S 2 E 2 and 
E 2 '■'■= S 2 E 2 ■ All the various grammars of FL, their prodsigs and nominal match¬ 
ing reports are exposed for inspection in the full report on the case study [27]. 

5 Abstract normalisation 

In order to apply the methodology based on nonterminal footprints, production 
signatures and their equivalence relations, we need the input grammars to com¬ 
ply with some assumptions that have been left informal so far. In particular, we 
can foresee possible problems with names/labels for production rules and subex¬ 
pressions, terminal symbols (often not a part of the abstract syntax), disjunction 
(inner choices, also non-factored), separator lists and other metasyntactic sugar, 
non-connected nonterminal call graph, inconsistent style of production rules, etc. 
If by C 7^ we denote the subset of production rules concerning one particular 
nonterminal: Vn = {p \ p = n ■.:= a, a G (A/" U T)*}, then we can define the 
Abstract Normal Form as follows: 

Definition 11. A grammar G = {JV,T,P,S), where T = 0 and S = {s}, is 
said to be in Abstract Normal Form, if and only if: 

• M is decomposable to disjoint sets, such that Af = A/+ U A/"_ U A/j_ 

• One of them is not empty and includes the root: s G A/+ U A/k 


• Nonterminals from one subset are undefined: n S A/j_ Vn = 0 

• Nonterminals from one other subset are defined with exactly one rule: 
n G M- => \'Pn\ = ^,'Pn = {n ::= a}, a G 

• Nonterminals from the other subset are defined with chain rules: 
n G Af+ => Vp G Vn, p = {n ::= x), x 

In fact, any grammar can be rewritten to assume this form: in our proto¬ 
type implementation, this is done by programming a grammar mutation [28,30]. 
(A grammar mutation is a general intentional grammar change that cannot be 
expressed independently of the grammar to which it will be applied. Thus, if 
“rename” is a parametric grammar transformation operator, then “rename A 
to B” is a transformation, but “rename all nonterminals to uppercase” is a mu¬ 
tation that is equivalent to transformations like “rename a to A” and “rename 
b to B” depending on the input grammar). Our prototype, normal: :ANF, is a 
metaprogram in Rascal [13] that is available for inspection as open source [32]. It 
is in fact a superposition of mutations that address the items from the definition 
individually: remove labels, desugar separator lists, fold/unfold chain production 
rules, etc. 

All the rewritings performed by transforming a grammar to its ANF, are as¬ 
sumed to be monadic in the sense of not only normalising the grammar, but also 
yielding a bidirectional grammar transformation chain which execution would 
normalise the grammar. (In our implementation, these steps are specified in 
the EBGF language, primarily because no other bidirectional grammar trans¬ 
formation operator suite exists [30]). The bidirectional grammar transformation 
chain can then be coupled to the bidirectional mapping between language in¬ 
stances from Def. 2, with the methodology described by [28]. This is required 
for traceability: the conversion to ANF is one of the steps to achieve automated 
convergence, not a one-way preprocessing. 

6 Discussion 

To summarise, grammar convergence is a technique of relating different gram¬ 
mars in a broad sense of the same intended software language [15]. It relies on 
the transformations being programmed by an experienced grammar engineer: 
even beside the required expertise, the process is not incremental — the trans¬ 
formation steps need to be considered carefully and constructed for each new 
grammar added to the mix. With the definitions from previous sections, we have 
described the process of guided grammar convergence, where the master gram¬ 
mar of the intended language is constructed once, and the transformations are 
inferred for any directly available grammars as well for the ones possibly added 
in the future. The process works as follows. 

• Extract pure grammatical knowledge from the grammar source. 

• Use grammar mutations to preprocess your grammars, if necessary. 

• Normalise the grammar by removing all problematic/ambiguous constructs. 

• Start by matching the roots of the connected normalised grammar. 


• Match multiple production rules by prodsig-equivalence; infer new nominal 
matches by matching equivalent prodsigs. Repeat for all nonterminals. 

• If several matches are possible, explore all and fallback in case of failure. If 
global nominal resolution scheme was impossible to infer, fail. 

• Resolve structural differences in the production rules that matched nomi¬ 
nally. 

To evaluate the method of guided grammar convergence, we have applied 
it to a case study of 11 different grammars of the same intended functional 
language that was defined and used earlier in order to demonstrate the original 
grammar convergence method that converged 5 of these grammars. The follow¬ 
ing grammar sources were used (all of them are available in the repository of 
SLPS [32] together with their evolution history and authorship attribution): 

adt: an algebraic data type^ in Rascal [13]; 

antlr: a parser description in the input language of ANTLR [19], with semantic 
actions (in Java) intertwined with EBNF-like productions; 
deg: a logic program written in the style of definite clause grammars [20] ; 
emf: an Ecore model, automatically generated by Eclipse [5] from the XML 
Schema of the xsd source; 

jaxb: an object model obtained by a data binding framework, generated auto¬ 
matically by JAXB [7] from the XML schema for EL; 
om: a hand-crafted object model (Java classes) for the abstract syntax of EL; 
python: a parser specification in a scripting language, using the PyParsing 
library [16]; 

rascal: a concrete syntax specification in the metaprogramming language of 
Rascal language workbench [13]; 

sdf: a concrete syntax definition in the notation of SDF [11] with scanner less 
generalized LR parsing as parsing model, 
txl: a concrete syntax definition in the notation of TXL (Turing extender Lan¬ 
guage) transformational framework [2], which, unlike SDF, uses a combina¬ 
tion of pattern matching and term rewriting), 
xsd: an XML schema [9] for the abstract syntax of EL. 

The complete case study is too big to be presented here, interested readers 
are redirected to a 40-1- page long report [27], containing all production rules, 
signatures, matchings and transformations. The case study was successful: on 
average ANF was achieved after 20-30 transformation steps, nominal resolution 
took up to 9 (proportional to the number of nonterminals) and structural res¬ 
olution needed 0-5 more steps, after which all 11 grammars were converged. 

ANTLR, DCG and PyParsing used layered definitions and therefore were the 
only three grammars to require mutation (another 2-6 steps). The case study is 
available for investigation and replication both in the form of Rascal metapro¬ 
grams at http://github.com/grammarware/slps [32] (the main algorithm is 
located in the converge: : Guided module which can be observed and modified 

http://tutor.rascal-mpl.org/Rascal/Declarations/AlgebraicDataType/AlgebraicDataType.html. 
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at shared/rascal/src/converge/Guided.rsc) and as a PDF report with all 
grammars and transformations pretty-printed automatically [27]. 

Guided grammar convergence is a methodology stemming from the gram- 
marware “technological space” [14] . When looking for similar techniques in other 
spaces (engaging in “space travel”), the obvious candidates are schema matching 
and data integration in the field of data modeling and databases [21]; comparison 
of UML models or metamodels in the context of model-driven engineering [6,26]; 
model weaving for product line software development [3,24]; computation of 
refactorings from different 00 program versions [4,18]; etc. For example, Del 
Fabro and Valduriez [3] utilised metamodel properties for automatically produc¬ 
ing weaving models. The core difference is that (meta)model weaving ultimately 
aims at incorporating all the changes into the resulting (meta)model, while 
guided grammar convergence also makes complete sense when some changes in 
the details are disregarded. The lowest limit in this process is needed (otherwise 
additional claims on the minimality of inferred transformations are required), 
and we specify this lowest limit as the master grammar. Another difference is 
that model weaving rarely involves a number of models bigger than two, and even 
our little case study of guided convergence had 10-1- grammars in it. In general, 
prodsig-based matching is more lightweight than those methods, since it in fact 
compares straightforwardly structured prodsigs, thus easily wins in performance 
and implementability but loses in applicability to complex scenarios. 

7 Conclusion 

We knew that language equivalence is undecidable and that grammar identity is 
trivial. In this paper, we have attempted to reach a useful level of reasoning about 
language relationships by departing from grammar identity as the “easy” side of 
the spectrum. This was done in the scope of grammar convergence, when several 
implementations of the same software language are inspected for compatibility. 

A definition of an intended language was provided (Def. 2) based on a bidi¬ 
rectional transformation between language entities. Then we revisited existing 
and possible techniques of structural matching that assumed nominal identity 
(Def. 3). In order to automatically infer nominal matching, we introduced non¬ 
terminal footprints (Def. 5), production signatures (Def. 7) and various degrees 
of equivalence among them. An extensive normalisation scheme (Def. 11) was 
proposed to transform any given grammar into the form most suitable for nomi¬ 
nal and then structural matching. It has been explained that when such normal¬ 
isation is not enough, a more targeted yet still automated approach is needed 
with grammar mutation strategies making the method robust with respect to 
different grammar design decisions, such as the use of layers instead of priorities 
or recursion instead of iteration. Just as all other parts of the proposed process, 
these mutations operate automatically and do not require human intervention. 

A case study was used to evaluate the proposed method of guided gram¬ 
mar convergence. The experiment concerned several implementations of a simple 
functional language in ANTLR, DCG, Ecore, Java, Python, Rascal, SDF, TXT, 


XML Schema. The diversity in language processing frameworks — metapro¬ 
gramming languages, declarative specifications, syntax definitions, algebraic data 
types, parsing libraries, transformation frameworks, software models, parser def¬ 
initions — was intentional and aimed at stressing the definition of the intended 
language and the guided convergence method. Casting all grammars from our 
case study to ANF allowed us to make inference quicker and with less obstacles, 
as well as to explain the process more clearly. 

All artifacts discussed on the pages of this paper, are transparently available 
to the public through a GitHub repository [32] . For each of the sources of the case 
study, one could inspect the original file, the extracted grammar, the extractor 
itself, the mutations that have been derived and applied, the normalisations 
to ANF, the normalised grammar, the nominal resolution and reasons for each 
match, as well as the structural resolution steps. One could also investigate the 
implementation of the method of guided grammar convergence, the algorithm 
for calculating prodsigs and the process of convergence. Supplementary material 
contains 40-1- pages of the full report, also generated by our prototype [27]. 

On the practical side, guided grammar convergence provides a balanced 
method of grammar manipulation, positioned right between unstructured inline 
editing (which makes grammar development very much like software develop¬ 
ment but lacks important properties such as traceability and reproducibility) 
and strictly exogenous functional transformation (which requires substantially 
more effort but is robust, repeatable and exposes the intended semantics). Its 
future role can be seen as a support for grammar product lines that allows 
both steady adaptation plans for deriving secondary artifacts from the reference 
grammar, and occasional inline editing of the derived artifacts with subsequent 
automated restoration of the adaptation scripts. This is a contribution to the 
field of engineering discipline for grammarware [12]. 
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